Antipattern: Specify Values in the Column Definition
Let's explore the insights we can derive from the antipattern 31 Flavors in this lesson.
We'll cover the following
Many people choose to specify the valid data values when they define a column. The column definition is part of the metadata, which is the definition of the table structure itself.
For example, we could define a
CHECK
constraint on the column. This constraint disallows any insert or update that would make the constraint false.
MySQL supports a non-standard data type called ENUM
that restricts the column to a specific set of values.
In MySQL’s implementation, we declare the values as strings, but internally the column is stored as the ordinal number of the string in the enumerated list. The storage is therefore compact, but when we sort a query by this column, the result is ordered by the ordinal value, not alphabetically by the string value. This may be unexpected behavior for us.
Other solutions include domains and user-defined types (UDTs). We can use these to restrict a column to a specific set of values and conveniently apply the same domain or data type to several columns within our database. But these features are not supported widely among brands of RDBMSs yet.
Finally, we could write a trigger that contains the set of permitted values and causes an error to be raised unless the status matches one of these values.
Baskin-Robbins 31 Ice Cream
All of these solutions share some disadvantages. The following sections describe some of these problems.
What was the middle one?#
Let’s suppose that we’re developing a user interface so that a user can edit bug reports. In order to make the interface guide a user to pick one of the valid status
values, we create a drop-down menu containing these values. How do we query the database for an enumerated list of values that are currently allowed in the status
column?
Our first instinct may be to query all the values currently in use with a simple query like the following:
However, if all the bugs are new, the previous query returns only NEW
. If we use this result to populate a user interface control with the status
of bugs, we could end up creating a chicken-and-egg situation; we can’t change a bug to any status other than those currently in use.
To get the complete list of permitted status
values, we need to query the definition of that column’s metadata. Most SQL databases support system views for these kinds of queries, but using them can be complex. For example, if we used MySQL’s ENUM
data type, we can use the following query to query the INFORMATION_SCHEMA
system views:
We can’t simply get the discrete enumeration values from the INFORMATION_SCHEMA
in a conventional result set. Instead, we get a string containing the definition of the check constraint or ENUM
data type. For example, the previous query in MySQL returns a column of type LONGTEXT
, with the value ENUM(’NEW’, ’IN PROGRESS’, ’FIXED’)
, including the parentheses, commas, and single quotes. We must write application code to parse this string and extract the individual quoted values before we can use them to populate a user interface control.
The queries needed to report check constraints, domains, or UDTs are progressively more complex. Most people choose the arduous path of manually maintaining a parallel list of values in the application code. This is an easy way for bugs to affect our project as application data becomes out of sync with the database metadata.
Adding a new flavor#
The most common alterations are to add or remove one of the permitted values. There’s no syntax to add or remove a value from an ENUM
or check constraint; we can only redefine the column with a new set of values. The following is an example of adding DUPLICATE
as one new status value in the MySQL ENUM
:
The previous definition of the column allowed NEW
, IN PROGRESS
, and FIXED
. This leads us back to the difficulty of querying the current set of values as described earlier.
Some database brands can’t change the definition of a column unless the table is empty. We may have to dump the contents of the table, redefine the table, and then import our saved data, making the table inaccessible in the meantime. This work is common enough that it has a name: ETL or “extract, transform, and load.” Other brands of database support restructuring a populated table with
ALTER TABLE
commands, but it can still be complex and expensive to perform these changes.
As a matter of policy, changing metadata — that is, changing the definition of tables and columns — should be infrequent and with attention to testing and quality assurance. If we need to change metadata to add or remove a value from an ENUM
, then we either have to skip the appropriate testing or spend a lot of software engineering effort on a short notice to make the change. Either way, these changes introduce risk and destabilize your project.
Old flavors never die#
If we make a value obsolete, we could upset historical data. For example, let’s see what happens when we change our quality control process to replace FIXED
with two stages, CODE COMPLETE
and VERIFIED
:
If we remove FIXED
from the enumeration, what do we do with bugs whose status was FIXED
? Should we advance all FIXED
bugs to VERIFIED
? Should we instead set obsolete values to null or to a default value?
We may have to keep an obsolete value that old rows reference. But then, how can we distinguish between obsolete values and exclude them from our user interface so that no one can set a bug’s status to the obsolete value?
Portability is hard#
Check constraints, domains, and UDTs are not supported uniformly among brands of SQL databases. The ENUM
data type is a proprietary feature in MySQL. Each brand of the database may have a different limit on the length of the list that we can give in a column definition. Trigger languages vary as well. These variations make it hard to choose a solution if we need to support multiple brands of databases.